• Machine Learning (some parts are still need updated)

    Linear regression/prediction

    image-20230505160733435

    formula the Lost function J(θ) and fine the parameters θ by the data (x, y)

    Matrix format is : y^=X θ , each x have d dimension features , total have n data pair ( y_i, x_i)

    image-20230505160748702

    Assume the each yi is Gaussian distributed with the mean xiTθ and variance σ2

    yi=N(xiTθ,σ2)=xiTθ+N(0,σ2)

    image-20230505193953662

    the likelihood for the linear regression will be

    p(yX,θ,σ)=i=1np(yixi,θ,σ)=i=1n(2πσ2)1/2e12σ2(yixiTθ)2=(2πσ2)n/2e12σ2i=1n(yixiTθ)2

    or using the log likelihood, L(θ)= n2log2πσ212σ2(yXθ)T(yXθ)

    L(θ)θ= 012σ2[02XTY+ 2XTXθ]=0

    so we get θ^=(XTX)1XTy => θ^=(XTX+δ2Ιd)1XTy

    Regularization δ2θTθ

    θ^=(XTX+δ2Id)1XTy
    J(θ)=(yXθ)T(yXθ)+δ2θTθ
    J(θ)θ=2xxθ2xy+2δ2Iθ

    image-20230506124401185

    the optmization point will be in the blue curve between the θML^ and θR(while σ2>)^

    Extend to nonlinear via basis function

    y(x)=ϕ(x)θ+ϵ using the ϕ(x) to deal with nonlinearity

    θ^ML=[ϕ(x)Tϕ(x)]1ϕ(x)Ty

    image-20230506130455991

    image-20230506130654112

    The σ will make some θ goto zero.

    Kernel regression and RBF (with same variance)

    ϕ(x)=[κ(x,μ1,λ),,κ(x,μd,λ)], e.g. κ(x,μi,λ)=e(1λxμi2)y^(xi)=ϕ(xi)θ=1θ0+k(xi,μ1,λ)θ1++k(xi,μd,λ)θd

    image-20230506132230224

    Example:

    ϕ(xi)=[1k(xi,μ1,λ)k(xi,μ2,λ)k(xi,μ3,λ)]

    ϕ(xi) is a vector with 4 entries. There are 3 bases. The corresponding vector of parameters is θ=[θ0θ1θ2θ3]

    y^i=ϕ(xi)θ

    If we have i=1,,N data, let

    Y=[y1y2yN]Φ=[ϕ(x1)ϕ(x2)ϕ(xN]

    Then

    Y^=Φθ and θ^ls=(ΦΦ)1ΦY

    or θ^ridge =(ΦΦ+δ2I)1Φy Hence, this is still linear regression, with' X replaced by Φ.

    in practise , we choose the location μ of the base function to be the inputs: μi=xi , but then we need choose the λ

    Using Cross-validation to get the regularization parameters δ2

    1. given training data (x,y) and some δ2guess , compute θ^

    2. y^train=xtrainθ^

    3. y^test=xtestθ^ or do this on validation data set

    so the paramters δ2 could be picked accoridngly differently - average or min_max

    image-20230506201640724

    image-20230506201845243

    image-20230506202305896

    Bayesian learning

    image-20230521171459119

    Given a dataset D={x1,,xn} : here the xi is one random variable . Bayes Rule:

    P(θD)=P(Dθ)P(θ)P(D)p(D|θ)likelihood function of θP(θ) Prior probability of θP(θD) Posterior distribution over θp(D)marginal likelihood function

    Computing posterior distribution is known as the inference problem. But:

    P(D)=P(D,θ)dθ=hHp(dh)p(h)

    This integral can be very high-dimensional and difficult to compute.

    p(h|d)= p(d|h)p(h)hHp(d|h)p(h)

    Model parameters:

    image-20230506205856792

    Give the parameters θ uncertainty , only give the prior distribution (prior knowledge)

    Learning material

    [Lecture 7. Bayesian Learning — ML Engineering (ml-course.github.io)

    [Gaussian processes (3/3) - exploring kernels (peterroelants.github.io)

    [Optimizing Parameters (mlatcl.github.io)](https://mlatcl.github.io/gpss/lectures/04-optimizing-parameters.html)

    Bayesian linear regression

    Understanding the training data D=(X,y) , (Xy可以理解为坐标系, the D is random variable. 在GPR中 y is function of x, x is not random variable, y is the random variable)

    With likelihood is Gaussian N(y|Xθ, σ2In), the conjugate prior is also a gaussian, which we will donate by p(θ)=N(θ|θ0, V0) --- probably just a guess or random choose, using Bayes rule for Gaussians the posterior is given by

    p(θ|X,Y,σ2) p(Y|X,θ, σ2)p(θ)=N(θ|θ0, V0)N(y|Xθ, σ2In)=N(θ|θn, Vn),σ2 is givenθn=VnV01  θ0+1σ2VnXTyVn1=V01+1σ2XTX,**Note: it is θ ’s variants**

    here is how the cacluation:

    p(θ|x,y,σ2)e12(yXθ)T(σ2I)1(yXθ) e12(θθ0)TV01(θθ0)=e12{(yXθ)0T(σ2I)(yXθ)+(θθ0)TV01(θθ0)}=e12{yT(σ2I)1y2yT(σ2I)1xθ+θTxT(σ2I)1xθ+θTv01θ2θ0TV01θ+θ0TV01θ0}=e12{const+Θ(x(σ2I)1x+V01)θcall this Vn1 2(Y(σ2I)1x+θ0V01)θ}=e12{ const +ΘVn1θ2(yxσ2+θ0V01)θ}=e12{ const +θVn1θ2θnVn1θ+2θnVn1θ2(yxσ2+θ0V01)θ}=e12{const2+(θθn)Vn1(θθn)+2[θnvn1yxσ2θ0V01]θ}

    in above steps we get Vn1=V01+1σ2XTX, To make θnvn1yxσ2θ0V01=0, so we get θn=Vn[V01θ0+XTYσ2] to get ride of the last term in above equation, and we know gaussian’s integral , **normalization constant** as below

    e12(θθn)TVn1(θθn)dθ=|2πVn|12

    The posterior p(θx,y,σ2)=|2πVn|12e12(θθ0)TVn1(θθ0) , so the const2=ln(2πVn) ?

    when θ0=0 and V0=τ2Id  which means the Gaussian prior is a spherical, θn=Vn[XTYσ2]=1σ2VnXTY=(λId+XTX)1XTy , λ= σ2τ2, it is ridge regression

    if τ (τ is prior gaussian variance , gaussian is flat... uniform distribution..we don't know the prior) , also it means λ0, it is maximum likelihood

    Theorem 4.4.1 (Bayes rule for linear Gaussian systems) (KPM)

    Given condition,Likelihood p(y|x)and Prior distribution p(x) (in Bayesian learning the x is the parameters, y is the data)

    (4.124)p(x)=N(xμx,Σx)p(yx)=N(yAx+b,Σy)

    Theorem 4.4.1 (Bayes rule for linear Gaussian systems). Given a linear Gaussian system, as in Equation 4.124, the posterior p(x|y) is given by the following:

    posterior distribution

    (4.125)p(xy)=N(xμxy,Σxy)Σxy1=Σx1+ATΣy1Aμxy=Σxy[ATΣy1(yb)+Σx1μx]

    after learning, the predict function (convolution of two gaussian ) p(y)=p(y|x,θ)p(θ)dθ)

    (4.126)p(y)=N(yAμx+b,Σy+AΣxAT)

    beside above proof, please also refer to chapter 4.4.3 (KPM), utilize the

    where the precision matrix of the joint is defined as

    Σ1=(Σx1+ATΣy1AATΣy1Σy1AΣy1)Λ=(ΛxxΛxyΛyxΛyy)

    From Equation 4.69 and using the fact that μy=Aμx+b, we have

    p(xy)=N(μxy,Σxy)Σxy=Λxx1=(Σx1+ATΣy1A)1μxy=Σxy(ΛxxμxΛxy(yμy))=Σxy(Σx1μ+ATΣy1(yb))

    Naive Bayesian learning

    P(c|x)=P(x|c)P(c)P(x)

    ml

    Bayesian prediction Vs ML prediction

    Posterior mean: θn=(λId+XTX)1XTy Posterior variance: Vn=σ2(λId+XTX)1

    The given training data D=(X,y). To predict, Bayesians marginalize over the posterior θml distribution. X is the new input:

    P(yx,D,σ2)=N(yxTθ,σ2)N(θθn,Vn)dθ=N(yxTθ,σ2)δml(θ)dθ=N(yxTθn,σ2+xTVnx)

    ML predictor have , which believe there only one true θML

    P(yx,D,σ2)=N(yxTθML,σ2)

    X = (x1,x2, xd) D = {(X1,  Y1), (X2,  Y2), (X3,  Y3), (Xn,  Yn)}

    μ = (μ1,μ2, μd), Σ =(Σ11Σ1dΣn1Σnd)

    Bayesian Networks(TBU)

    Bayesian optimization

    Bayesian Hyperparameter Optimization using Gaussian Processes | Brendan Hasz

    How to Implement Bayesian Optimization from Scratch in Python - MachineLearningMastery.com

    Tutorial #8: Bayesian optimization - Borealis AI

    Image for post

    Image for post

     

    Bayesian Optimization Algorithm:

    For t=1,2,… do

    1. Find xt by combining attributes of the posterior distribution in a utility function u and maximizing: xi=argmaxxu(x|D1:t1)

    2. Sample the objective function yt=f(xt)+ εt

    3. Augment the data D1:t={D1:t1, (xt, yt} and update the GP

    End for

    P(yt+1D1:t,xt+1)=N(μt(xt+1),σt2(xt+1)+σnoise 2)μt(xt+1)=kT[K+σnoise 2I]1y1:tσt2(xt+1)=k(xt+1,xt+1)kT[K+σnoise 2I]1k

    basic we should choose the next point x where the

    The acquisition function

    μ(x)+κσ(x)

    The acquisition function takes in the obtained GP fit (mean and variance at each point) and returns the hyper-parameter value for the next run of the machine learning algorithm. For more details on the derivation refer to this article. More kernel visualizations can be found here.

    Probability of Improvements

    PI(x)=P(f(x)μ++ξ)=Φ(μ(x)μ+ξσ(x))

    The Φ is the probability cumulative function.

    Expected improvement (Expected utility criterion/Bayesian and decision theory)

    EU(a)=xu(x,a)p(x|data), each action weighted by the probability

    a is action, u(x,a) is cost/reward model. we want to expected improvement is max --- quantify the amount of the improvement.

    At iteration n+1, choose the point that minimizes the distance o the objective evaluated at the maximum x *:

    xn+1=argminxE(fn+1(x)f(x)Dn)=argminxfn+1(x)f(x)p(fn+1Dn)dfn+1

    we don't have the true objective , so we get expected improvement function as following:

    x=argmaxxE(max{0,fn+1(x)fmax}Dn)
    EI(x)={(μ(x)μ+ξ)Φ(Z)+σ(x)ϕ(Z) if σ(x)>00 if σ(x)=0Z=μ(x)μ+ξσ(x)

    where ϕ() and Φ() denote the PDF and CDF of the standard Normal

    image-20230514211657419

     

    GP-UCB (Upper/lower confidence band)

    Define the regret and cumulative regret as follows:

    r(x)=f(x)f(x)RT=r(x1)++r(xT)

    The GP-UCB criterion is as follows:

    GPUCB(x)=μ(x)+νβtσ(x)

    Beta is set using a simple concentration bound: With ν=1 and βt=2log(td/2+2π2/3δ), it can be shown 2 with high probability that this method is no regret, i.e. limTRT/T=0. This in turn implies a lower-bound on the convergence rate for the optimization problem.

    image-20230514211800165

    image-20230514165712997

    Thompson sampling (TBS)

    αTHOMP. (x;θ,D)=g(x), where g(x) is sampled form GP(μ(x),k(x,x))

    image-20230514213022169

    Entropy search and predictive entropy search (TBS)

    αES(x;θ,D)=H[p(xmin D)]Ep(yD,x)[H[p(xmin D{x,y})]]

    Bayesian Hyperparameter Optimization

    Bayesian approaches, in contrast to random or grid search, keep track of past evaluation results which they use to form a probabilistic model mapping hyperparameters to a probability of a score on the objective function:

    P( score  hyperparameters )

    In the literature, this model is called a “surrogate” for the objective function and is represented as p(y | x). The surrogate is much easier to optimize than the objective function and Bayesian methods work by finding the next set of hyperparameters to evaluate on the actual objective function by selecting hyperparameters that perform best on the surrogate function. In other words:

    1. Build a surrogate probability model of the objective function

    2. Find the hyperparameters that perform best on the surrogate -- have one best guess from the acquisition function.

    3. Apply these hyperparameters to the true objective function -- get the new loss data from loss function. (scores)

    4. Update the surrogate model incorporating the new results --- put the x (hyperparameters) and y (loss data from true object function) to fit/update the model

    5. Repeat steps 2–4 until max iterations or time is reached

    At a high-level, Bayesian optimization methods are efficient because they choose the next hyperparameters in an informed manner**.** The basic idea is: spend a little more time selecting the next hyperparameters in order to make fewer calls to the objective function.

    Sequential Model-Based Optimization

    Sequential model-based optimization (SMBO) methods (SMBO) are a formalization of Bayesian optimization. The sequential refers to running trials one after another, each time trying better hyperparameters by applying Bayesian reasoning and updating a probability model (surrogate).

    There are five aspects of model-based hyperparameter optimization:

    1. A domain of hyperparameters over which to search

    2. An objective function which takes in hyperparameters and outputs a score that we want to minimize (or maximize)

    3. The surrogate model of the objective function

    4. A criteria, called a selection function, for evaluating which hyperparameters to choose next from the surrogate model

    5. A history consisting of (score, hyperparameter) pairs used by the algorithm to update the surrogate model

    There are several variants of SMBO methods that differ in steps 3–4, namely, how they build a surrogate of the objective function and the criteria used to select the next hyperparameters. Several common choices for the surrogate model are Gaussian Processes, Random Forest Regressions, and Tree Parzen Estimators (TPE) while the most common choice for step 4 is Expected Improvement. In this post, we will focus on TPE and Expected Improvement.

    Domain

    In the case of random search and grid search, the domain of hyperparameters we search is a grid.

    Objective Function -- the true objective function

    The objective function takes in hyperparameters and outputs a single real-valued score that we want to minimize (or maximize) --- loss function or maximum likelihood function. As an example, let’s consider the case of building a random forest for a regression problem. The hyperparameters we want to optimize are shown in the hyperparameter grid above and the score to minimize is the Root Mean Squared Error.

    While the objective function looks simple, it is very expensive to compute! If the objective function could be quickly calculated, then we could try every single possible hyperparameter combination (like in grid search). If we are using a simple model, a small hyperparameter grid, and a small dataset, then this might be the best way to go. However, in cases where the objective function may take hours or even days to evaluate, we want to limit calls to it.

    The entire concept of Bayesian model-based optimization is to reduce the number of times the objective function needs to be run by choosing only the most promising set of hyperparameters to evaluate based on previous calls to the evaluation function. The next set of hyperparameters are selected based on a model of the objective function called a surrogate.

    Surrogate Function (Probability Model)

    The surrogate function, also called the response surface, is the probability representation of the objective function built using previous evaluations. This is called sometimes called a response surface because it is a high-dimensional mapping of hyperparameters to the probability of a score on the objective function. Below is a simple example with only two hyperparameters:

    img

    There are several different forms of the surrogate function including Gaussian Processes and Random Forest regression. However, in this post we will focus on the Tree-structured Parzen Estimator as put forward by Bergstra et al in the paper “Algorithms for Hyper-Parameter Optimization”.

    Selection Function

    please refer to the The acquisition function

    The selection function is the criteria by which the next set of hyperparameters are chosen from the surrogate function.

    Moreover, because the surrogate is just an estimate of the objective function, the selected hyperparameters may not actually yield an improvement when evaluated and the surrogate model will have to be updated. This updating is done based on the current surrogate model and the history of objective function evaluations.

    History

    Each time the algorithm proposes a new set of candidate hyperparameters, it evaluates them with the actual objective function and records the result in a pair (score, hyperparameters). These records form the history. The algorithm builds "Surrogate function" using the history to come up with a probability model of the objective function that improves with each iteration.

    Bayesian optimization with GPs in practice

    consider how to deal with noisy observations, how to choose a kernel, how to learn the parameters of that kernel, how to exploit parallel sampling of the function.

    in noise GP model, We no longer observe the function values f[x] directly, but observe noisy corruptions y[x]=f[x]+ϵof them. The joint distribution of previously observed noisy function values y and a new unobserved point f becomes:

    (13)Pr([yf])=Norm[0,[K[X,X]+σn2IK[X,x]K[x,X]K[x,x]]],

    and the conditional probability of a new point becomes:

    (14)Pr(f|y)=Norm[μ[x],σ2[x]],

    where

    μ[x]=K[x,X](K[X,X]+σn2I)1f(15)σ2[x]=K[x,x]K[x,X](K[X,X]+σn2I)1K[X,x].

    Incorporating noise means that there is uncertainty about the function even where we have already sampled points , and so sampling twice at the same position or at very similar positions could be sensible.

     

    Tutorial #8: Bayesian optimization

    Kernel choice

    When we build the model of the function and its uncertainty, we are assuming that the function is smooth. If this was not the case, then we could say nothing at all about the function between the sampled points. The details of this smoothness assumption are embodied in the choice of kernel covariance function.

    We can visualize the covariance function by drawing samples from the Gaussian process prior. In one dimension, we do this by defining an evenly spaced set of points X=[x1,x2,,xI], drawing a sample from Norm[0,K[X,X]] and then plotting the results. In this section, we’ll consider several different choices of covariance function, and use this method to visualize each.

    Squared Exponential Kernel: include the amplitude α which controls the overall amount of variability and the length scale λ which controls the amount of smoothness:

    k[x,x]=α2exp[d22λ],

    where d is the Euclidean distance between the points: d=(xx)T(xx).

    Matérn kernel:

    (17)k[x,x]=α2exp[dλ2],

     

    Periodic Kernel: τ is the period of the oscillation and the other parameters have the same meanings as before.

    (20)k[x,x]=α2exp[2(sin[πd/τ])2λ2],

    Learning GP parameters

    1. Maximum likelihood: similar to training ML models, we can choose these parameters by maximizing the marginal likelihood (i.e., the likelihood of the data after marginalizing over the possible values of the function):

    Pr(y|x,θ)=Pr(y|f,x,θ)df(21)=Normy[0,K[X,X]+σn2I],

    where θ contains the unknown parameters in the kernel function and the measurement noise σn2.

    In Bayesian optimization, we are collecting the observations sequentially, and where we collect them will depend on the kernel parameters, and we would have to interleave the processes of acquiring new points and optimizing the kernel parameters.

    2. Full Bayesian approach: here we would choose a prior distribution Pr(θ)on the kernel parameters of the Gaussian process and combine this with the likelihood in equation (21) to compute the posterior. We then weight the acquisition functions according to this posterior:

    (22)a^[x]a[x|θ]Pr(y|x,θ)Pr(θ).

    In practice this would usually be done using an Monte Carlo approach in which the posterior is represented by a set of samples (see Snoek et al., 2012) and we sum together multiple acquisition functions derived from these kernel parameter samples (figure 9).

    Tutorial #8: Bayesian optimization

    Figure: Bayesian approach for handling length scale of kernel from Snoek et al., 2012. a-c) We fit the model with three different length scales and compute the acquisition function for each. d) We compute a weighted sum of these acquisition functions (black curve) where the weight is given by posterior probability of the data at each scale (see equation 22). We choose the next point by finding the maximum of this weighted function (black arrow). In this way, we approximately marginalize out the length scale.

     

    Optimization approach

    cost function

    J(θ)=(YXθ)T(YXθ)=i=1n(yixiTθ)2(remove the data batch n here)= i=1n(yiy^i)2

    Gradient or solve this directly (=0, when equal to 0, it mean reach the minimal point):

    非矩阵求解:

    image-20230505160844056

    矩阵求解:

     

    J(θ)θ=θ[YTY+θTXTXθ2YTXθ]=0+2XTX2XTY=0

    θ=(XTX)1XTY

    Gradient

    image-20230505160859363

    The gradient vector of f(θ) , θf(θ)=[f(θ)θ1f(θ)θ2f(θ)θn], the

    Hessian

    image-20230505194109147

    image-20230505194137307

    image-20230505194200413

    image-20230505194215095

    Online learning/Stochastic gradient descent (SGD)

    Batch θk+1=θk+ηi=1nxiT(yixiθk)

    Online θk+1=θk+ηxkT(ykxkθk)

    Mini-batch θk+1=θk+ηi=120xiT(yixiθk)

    蒙特卡洛梯度估计方法(MCGE)

    机器学习中最常见的优化算法是基于梯度的优化方法,当目标函数是一个类似如下结构的随机函数 F(θ) 时:

    img

    优化该类目标函数,最核心的计算问题是对随机函数 F(θ) 的梯度进行估计,即:

    img

    随机函数梯度估计在机器学习以及诸多领域都是核心计算问题,比如:变分推断,一种常见的近似贝叶斯推断方法;强化学习中的策略梯度算法;实验设计中的贝叶斯优化和主动学习方法等。其中,对于函数期望类目标问题,最常见的是基于蒙特卡洛采样的方法。

    蒙特卡洛采样(MCS)

    MCS 是一种经典的求解积分方法,公式(1)中的问题通常可以用 MCS 近似求解如下:

    img

    其中,x^(n) 采样自分布 p(x;θ),由于采样的不确定性和有限性,这里 F¯N 是一个随机变量,公式(3)是公式(1)的蒙特卡洛估计器(MCE)。这类方法非常适用于求解形式如公式(1)的积分问题,尤其是当分布 p(x;θ) 非常容易进行采样的时候。

    The least squares

    estimates: θ^=(XTX)1XTy

    ERM, Empirical risk minimization

    image-20230505235902121

    image-20230505235911011

    Univariate/Multivariate Gaussian distribution

    Reference material

    https://gaussianprocess.org/gpml/

    https://zhuanlan.zhihu.com/p/31427491

    https://www.cnblogs.com/tornadomeet/archive/2013/06/15/3137239.html

    https://sandipanweb.wordpress.com/2020/12/08/gaussian-process-regression-with-python/

    Univariate Gaussian distribution probability density function(pdf)

    p(x)=(2πσ2)1/2e12σ2(xμ)2 x N(μ,σ2)

    in python sampling normal distribution like : x = sigma * np.random.randn(10) + mu

    sampling the distribution using the uniform distribution generate the random number and mapping it to the cumulative Distribution Functions (CDFs) function to get the samples

    image-20230505135313111

    Multivariate Gaussian distribution

    X=(x1,x2....)T is multiple random variables, The pdf of an n-dimensional Gaussian is : p(X)=(2πΣ)1/2e12(Xμ)TΣ1(Xμ)

    μ= (μ1μn) =(E(x1)E(xn))  and Σ =(σ11σ1nσn1σnn)=E[(Xμ)(Xμ)T]

    in case two independent univariate Gaussian variables , x1=N(μ1,σ2),x2=N(μ2,σ2)

    their joint distribution is

    p(x1,x2)=p(x1)p(x2)=(2πσ2)1e1/2[(x1μ1),(x2μ2)][σ2,00,σ2]1[x1μ1x2μ2]

    Σ=[σ2,00,σ2]1 and μ=[μ1μ2]

    XN(μ,Σ),Σ=BBT(choleskyrule)Xμ+BN(0,1)

    Theorem 4.3.1 (marginal and conditionals of an MVN)

    Theorem 4.3.1 (marginal and conditionals of an MVN), suppose X=(X1,X2)  is a jointly Gaussian with parameters

    μ=(μ1μ2), Σ=(Σ11Σ12Σ21Σ22), Λ=Σ1=(Λ11Λ12Λ21Λ22)

    then the marginals are given by

    p(x1)=N(x1|μ1, Σ11)

    p(x2)=N(x2|μ2, Σ22)

    and the posterior conditional distribution is given by (refer to KPM to know following equations detailly)

    (4.69)p(x1x2)=N(x1μ12,Σ12)μ12=μ1+Σ12Σ221(x2μ2)=μ1Λ111Λ12(x2μ2)=Σ12(Λ11μ1Λ12(x2μ2))Σ12=Σ11Σ12Σ221Σ21=Λ111

    推导过程看KPM chapter 4.3.4.3.. this 4.69 also used for the proof of the Theorem 4.4.1(Bayes rule for linear Gaussian systems).

    (4.118)p(x1,x2)=p(x1x2)p(x2)(4.119)=N(x1μ12,Σ12)N(x2μ2,Σ22)(4.120)μ12=μ1+Σ12Σ221(x2μ2)(4.121)Σ12=Σ/Σ22=Σ11Σ12Σ221Σ21

    也即是根据Joint distribution p(x1,x2) 的mean,covariance 能求出各个conditional 或者又叫marginal distribution 的μ12Σ12

    Gaussian process (random process)

    高斯过程,从字面上分解,我们就可以看出他包含两部分

    - 高斯,指的是高斯分布

    - 过程,指的是随机过程

    v2-13bb48ae827530521164f00f9a89311e_720w

    首先当随机变量是1维的时候,我们称之为一维高斯分布,概率密度函数 p(x)=N(μ,σ2)当随机变量的维度上升到有限的 p 维的时候,就称之为高维高斯分布, p(x)=N(μ,Σp×p). 而高斯过程则更进一步,他是一个定义在连续域上的无限多个高斯随机变量所组成的随机 过程,换句话说,高斯过程是一个无限维的高斯分布

    对于一个连续域 T (假设他是一个时间轴),如果我们在连续域上任选 n 个时刻: t1,t2,t3,...,tn∈T ,使得获得的一个 n 维向量 ξ1,ξ2,ξ3,...,ξn 都满足其是一个 n 维高斯分布,那么这个 ξt 就是一个高斯过程

    对于一个 p 维的高斯分布而言,决定他的分布是两个参数,一个是 p 维的均值向量 μp ,他反映了 p 维高斯分布中每一维随机变量的期望,另一个就是 p×p 的协方差矩阵 Σp×p ,他反映了高维分布中,每一维自身的方差,以及不同维度之间的协方差

    定义在连续域 T 上的高斯过程其实也是一样,他是无限维的高斯分布,他同样需要描述每一个时间点 t 上的均值,但是这个时候就不能用向量了,因为是在连续域上的,维数是无限的,因此就应该定义成一个关于时刻 t 的函数: m(t)

    协方差矩阵也是同理,无限维的情况下就定义为一个核函数 k(s,t) ,其中 s 和 t 表示任意两个时刻,核函数也称协方差函数。核函数是一个高斯过程的核心,他决定了高斯过程的性质,在研究和实践中,核函数有很多种不同的类型,他们对高斯过程的衡量方法也不尽相同,这里我们介绍和使用最为常见的一个核函数:径向基函数 RBF - Squared Exponential Kernel ,其定义如下

    k(xa,xb)=σ2exp(xaxb222)

    这里面的 σ 和 l 是径向基函数的超参数,使我们提前可以设置好的,例如我们可以让 σ=1 , l=1 ,从这个式子中,我们可以解读出他的思想

    和 t 表示高斯过程连续域上的两个不同的时间点, ||st||2 是一个二范式,简单点说就是 s 和 t 之间的距离,径向基函数输出的是一个标量,他代表的就是 s 和 t 两个时间点各自所代表的高斯分布之间的协方差值,很明显径向基函数是一个关于 s,t 距离负相关的函数,两个点距离越大,两个分布之间的协方差值越小,即相关性越小,反之越靠近的两个时间点对应的分布其协方差值就越大。

    由此,高斯过程的两个核心要素:均值函数和核函数的定义我们就描述清楚了,按照高斯过程存在性定理,一旦这两个要素确定了,那么整个高斯过程就确定了:

    ξtGP(m(t),k(t,s))

     

    The random process is the f(x), x is not random variable , for example it could time t. but the f(x) is random process which give a certain xi, the f(xi) is one random variable .

    f(x)GP(m(x),k(x,x))

    means the multiple random f(x)=((f(x1),f(x2)...f(xn)) have joint gaussian distribution

    μ= (μ1=m(1)μn=m(Xn)) and Σ =(σ11=k(x1,x1)σ1n=k(x1,xn)σn1=k(xn,x1)σnn=k(xn,xn))

    and k(xa,xb)=σ2exp(xaxb222) , those σ2 and need to learn by data.

    image-20230505145229063

    Cholesky Decomposition 定理: 若 ARn×n 是对称正定矩阵 Q ,则存在一个对角元全为正数的下三角矩阵 LRn×n ,使得 A=LT 成立。 L 是一个下三角形,形式是这样的:

    L=(l1100l21l220ln1ln2lnn)LT=(l11l21ln10l22ln200lnn)

    So sampling : fμ+LN(0,1) this is the NOT mean function (n,1) = (n,1) + (n,n)(n,1), we sampling N(0,1) for each fi

    kernel function https://peterroelants.github.io/posts/gaussian-process-kernels/

    Noiseless GP regression

    while give new f where there are given f¯N(0,k) http://www.cnblogs.com/hxsyl/p/5229746.ht

    image-20230515090059891

     

    合法的协方差矩阵就是 (symmetric) Positive Semi-definite Matrix

      矩阵A正定是指,对任意的X≠0恒有XTAX0
      矩阵A半正定是指,对任意的X≠0恒有XTAX0

      判定A是半正定矩阵的充要条件是:A的所有顺序主子式大于或等于零。

      如果你了解SVM的话,就会接触过一个著名的Mercer Theorem,(当然如果你了解泛函分析的话也会接触过 ),这个M定理是在说:一个矩阵是Positive Semi-definite Matrix当且仅当该矩阵是一个Mercer Kernel .

    所以这是一种贝叶斯方法,和OLS回归不同,这个方法给出了预测值所隶属的整个(后验)概率分布的。再强调一下,我们得到的是f* 的整个分布!不是一个点估计

    image-20230510234021480

    GP posterior general (Bayesian) D={(Xi, fi), i=1:N}, where fi=f(xi) is the noise-free observation of the function evaluated at xi

    Given a test set X of size N×D, we want to predict the function output f , By definition of the GP, the joint distribution has the following form***

    (ff)N((μμ), (KKKTK))

    Where K=κ(X,X) , K=κ(X,X) is N×N and K=κ(X,X) is N×N, the posterior has the following form (by utilize the 4.69):

    (15.7)p(f|X, X, f)=N(f|μ, Σ)
    (15.8)μ= μ(X)+KTK1(fμ(X))
    (15.9)Σ=K KTK1K

    here we could use a squared exponential kernel, aka Gaussian kernel or RBF kernel. In 1d, this is given by

    κ(X,X )=σ2exp(12l2(xx)2)

    the real algorithm for this computing

    f=kTKy1yKy=LLTα=Ky1y=LTL1y
     Algorithm 15.l: GP regression 1L= cholesky (K+σy2I);2α=LT(Ly);3E[f]=kTα;4v=Lk;5var[f]=κ(xx)vTv;6logp(yX)=12yTαilogLiiN2log(2π)

    Noise GP regression

    image-20230515090941017

    κy(xp,xq)=σf2exp(122(xpxq)2)+σy2δpq

    We can extend the SE kernel to multiple dimensions as follows:

    κy(xp,xq)=σf2exp(12(xpxq)TM(xpxq))+σy2δpq

    Learning Covariance Parameters (also called hyper-parameters) via maximum likelihood

    Refer to KPM, chapter 15.2.4 Estimating the kernel parameters

    Refer to Maximum Likelihood Estimation of Gaussian Parameters (jrmeyer.github.io)

    Learning the kernel parameters (maximum (marginal) likelihood, Bayesian, cross validation)

    include the Noisy parameters σy

    logp(yX)=logN(y0,Ky)=12yKy1y12log|Ky|N2log(2π) "--the equation used in code for gradient"θjlogp(yX)=12yTKy1KyθjKy1y12tr(Ky1Kyθj)=12tr((ααTKy1)Kyθj)

    where α=Ky1y. It takes O(N3) time to compute K1y , and then O(N2) time per hyperparameter to compute the gradient.

    κy(X,X )=E(f(x)m(x)(f(x)m(x)T)=σf2exp(12l2(xx)2)+σy2δ=σf2exp(12(xx)TM(xx)+σy2δ, 

    here M1=l2I, M2=diag(l)2, M3= ΛΛT+ diag(l)2

    Optimize it Using GD to get the θ=l or σyor σf

    The form of Kyθj depends on the form of the kernel, and which parameter we are taking derivatives with respect to. Often we have constraints on the hyper-parameters, such as σy20. In this case, we can define θ=log(σy2), and then use the chain rule. Given an expression for the log marginal likelihood and its derivative, we can estimate the kernel parameters using any standard gradient-based optimizer. However, since the objective is not convex, local minima can be a problem.

    image-20230513165946657

    Kl=σ2exp( ((xx)T(xx))2l2 )((xx)T(xx)l3)

     Kσf =2σexp ((xx)T(xx)2l2 )

     Kσy

    In another webpage(Optimizing Parameters (mlatcl.github.io))(I also don't see no any further steps based on those complex equations, in GPy code??) :

    E(θ)=12logdetK+yK1y2

    The parameters are inside the covariance function (matrix)

    ki,j=k(xi,xj;θ)K=RΛ2R

    imgΛ represents distance on axes. R gives rotation.

    • Λ is diagonal, RR=I.

    • Useful representation since detK=detΛ2=detΛ2.

    image-20230513192941389

     

    Scalability of the Gradient (no any further steps)

    In the previous section, we assumed that we can compute the gradient exactly. However, if the dimension of the vector y, n increases, it might not be possible to compute the above gradient in a reasonable time and cost. Let’s analyze the computational complexity of each term. Learning Gaussian Process Covariances (chrischoy.github.io)

    make the Gradient of the Posterior Probability easier

    Tr(K1Kθi)=Tr(K1KθiE[rrT])=E[rTK1Kθir]
    θilogp(yX,f;θ)12NiNriTK1Kθiri12(yf)TK1KθiK1(yf)

    the computing complexity from O(n3) complexity O(κn3) or to O(κn2Ns) where Ns is the number of samples.

    GPR vs RBF

    image-20230505232157114

    Ridge regression : (XTX+δ2Id)θ=XTy , also the solution can be written as θ=XTα, where α=δ2(yXθ),

    α can also be written as follows: α=(XXT+δ2In)1y, proof like following

    δ2α=yxθδ2α=yxXαXXα+δ2Inα=yα=(XX+δ2In)1y

    so the computation can be done with θ d (feature number) dimension or with α n (data number) dimension

    y=xθ=xXα=xX[XX+δ2In]1y=kKy1y
    kn=[xx1xx2xxn] Ky=[x1xn][x1x4]=[x1x1T,,x1xnTxnx1T,,xnxnT]δ2 is same parameters as σy

    each xixjT is dxd matrix.

     

    Likelihood for a Gaussian (lots of algebra)

    Maximum Likelihood Estimation of Gaussian Parameters (jrmeyer.github.io)

    μMLE=argmaxμN(X|μ,Σ)ΣMLE=argmaxΣN(X|μ,Σ)
    LL=n=1N(12log(2πσ2)12((xnμ)2σ2))=N2log(2πσ2)+n=1N12((xnμ)2σ2)=N2log(2πσ2)12σ2n=1N(xnμ)2

    Jensen‘s inquality

    Theorem (Jensen's Inequality) If f:RR is a convex function, and x is a random variable, then

    Ef(x)f(Ex)

    Moreover, if f is strictly convex(凸), then equality implies that x=Ex with probability 1 (i.e. x is a constant).

    g(x)=logxlog(E[x])E[logx]log(E[x])E[logx]

    Lower Bound of marginal log-likelihood

    logp(xθ)=log[zp(x,zθ)]=log[zq(z)(p(x,zθ)q(z))] (log of an expectation) zq(z)log(p(x,zθ)q(z))L(q,θ) (expectation of log ) 

    Variational lower bound and ELBO

    For any PMF q(z), we have a lower bound on the marginal log-likelihood

    logp(xθ)zq(z)log(p(x,zθ)q(z))L(q,θ)

    marginal log likelihood logp(x|θ) also called evidence. so the L(q,θ) is the evidence lower bound or ELBO

    To select the inducing input Xm, apply variational inference in an augmented probability space that involve both the tanning latent function value f and the pseudo-input inducing variable fm

    The initial joint model p(y,f) is augmented with the variable fm to form the model

    p(y,f, fm)=p(y|f)p(f|fm)p(fm)

    where the conditional prior is given by

    p(f|fm)=N(f|KnmKmm1fm, KnnKnmKmm1Kmn)

    Posterior distribution

    p(f|y) p(f, fm|y)=p(f|fm,y)p(fm|y)

    Log Marginal likelihood

    logp(y)=logp(y|f)p(f|fm)p(fm)dfdfm

    Approximate the true posterior distribution by introducing a variational distribution q(f,fm) and minimize the KL divergence:

    KL(q(f,fm)||p(f, fm|y))=q(f,fm)logq(f,fm)p(f,fm|y)dfdfm

    To determine the variational quantities (Xm, ϕ), we minimize the KL divergence, which is equivalently expressed as the maximum of the following variational low bounder on the true log marginal likelihood

    logp(y)L  or Fv(Xm,ϕ= q(z)log(p(z,x)q(z))dz= q(f,fm)log(p(f,fm,y)q(f,fm))dfdfm= p(f|fm)ϕ(fm)logp(y|f)p(f|fm)p(fm)p(f|fm)ϕ(fm)dfdfm= p(f|fm)ϕ(fm)logp(y|f)p(fm)ϕ(fm)dfdfm = ϕ(fm){p(f|fm)logp(y|f)df+ logp(fm)ϕ(fm)}dfm 

    Condition givens

    p(y|f)= N(f, σnoise2I)

    p(f|fm)=N(f|KnmKmm1fm, KnnKnmKmm1Kmn)

    logG(fm,y)= p(f|fm)logp(y|f)df=p(f|fm){n2log(2πσ2)12σ2Tr[yyT2yfT+ffT ]}df =n2log(2πσ2) 12σ2Tr[yyT2yαT+ααT+KnnQnn ]

    logG(fm,y)=log[N(y|α, σ2I)]12σ2Tr(KnnQnn)α=E[f|fm]= KnmKmm1fm and  Qnn=KnmKmm1Kmn

    Fv(Xm,ϕ)= ϕ(fm){p(f|fm)logp(y|f)df+ logp(fm)ϕ(fm)}dfm = ϕ(fm)logN(y|α, σ2I)p(fm)ϕ(fm)dfm12σ2Tr(KnnQnn) 

    Maximum the bound w.r.t. the distribution ϕ

    1. The usual way of doing this is to take the derivative w.r.t. ϕ(fm), set to zero and obtain the optimal

    ϕ(fm)= N(y|α, σ2I)p(fm)N(y|α, σ2I)p(fm)dfm

    1. a faster and by far simpler way to compute the optimal boundis by reversing the Jensen’s inequality, moving the log outside of the integral in eq

    Fv(Xm)=logN(y|α, σ2I)p(fm)dfm12σ2Tr(KnnQnn)=logN(y|0,σ2I+Qnn)12σ2Tr(KnnQnn)

    The optimal distribution ϕ that gives rise to this bound is given by

    ϕ(fm)  N(y|α, σ2I)p(fm)

    ϕ(fm)= c exp{12fmT(Kmm1+1σ2Knm1KmmKnmKmm1)fm+1σ2yTKnmKmm1fm}

    KL dirvergenc and Variational inference

    Definition 8.11 (KL divergence). For two distributions q(x) and p(x)

    KL(qp)=DKL(qp)=logq(x)logp(x)q(x)0DKL(qp)=q(x)log(q(x)p(x))dx

    using approximate q(z) to estimate p(z|x)

    翻译一下就是, 在给定数据情况下不知道latent variable 的分布,比如GMM,不知道中间有几个高斯以及各个高斯的参数,采用猜测给定高斯分布q(z)(随机分配),然后最大化ELBO的方式逼近p(z|x),逼近的结果就是得到逼近想要的p(x|θ) 也即是p(x)

    DKL(q|p)= q(z)log(q(z)p(z|x))dz= q(z)log(p(z|x)q(z))dz

    Conditional distribution:p(z|x)=P(z,x)p(x)

    DKL(q|p)=  q(z)log(p(z|x)q(z))dz

    = q(z)log(p(z,x)q(z)1p(x))dz

    = q(z)log(p(z,x)q(z))dzq(z)log(1p(x))dz

    =   q(z)log(p(z,x)q(z))dz+q(z)log(p(x))dz

    so we get: DKL(q(z)|p(z|x))+L=log(p(x)), L= q(z)log(p(z,x)q(z))dz=KL(q(z)|p(z,x)) . using L to manipulate the KL(q|p), so maximum L --> minimum DKL(q|p)

     

    image-20230505233755779

    MLE

    Maximum likelihood estimation properties

    The goal is to maximum the likelihood of given the tanning data (x1:n,y1:n) choose the value of the parameters (θ, σ) has more probability of generate the data

    θ^=argmaxθp(x1:nθ)

    L(θ,σ)= n2log2πσ212σ2(yXθ)T(yXθ)

    partial derivate to σ:

    L(θ,σ)σ=0σ2=1n(yxθ)(yxθ)=1ni=1n(yixi)2

    partial derivate to θ:

    L(θ)θ= 012σ2[02XTY+ 2XTXθ]=0

    θ^=(XTX)1XTy => θ^=(XTX+δ2Ιd)1XTy

    Entropy H is a measure of the uncertainity associated with a random variable. it defined as:

    H(X)=Σxp(x|θ)logp(x|θ)

    Bernoulli distribution Example, the entropy is

    H(x)=x=01θx(1θ)kxlog[θx(1θ)1x]=[(1θ)log(1θ)+θlogθ]

    Maximum MLE means minimizes the KL divergence

    θ^=argmaxθi=1Np(xiθ)=argmaxθi=1Nlogp(xiθ)=argmaxθ1Ni=1Nlogp(xiθ)1Ni=1Nlogp(xiθ0)=argmaxθ1Ni=1Nlogp(xiθ)p(xiθ0)=argmaxθp(xiθ0)logp(xiθ)p(xiθ0)dx,xip(xi|θ0)(thetruedistribution)argminθlogp(xθ0)p(xθ)p(xθ0)dx=KL(p(xθ0)|p(xθ))=argminθP(xθ0)logP(xθ0)dxinformation world -H(x) P(xθ0)logP(xθ)dxinformation in model - cross entropy $$

    Apply the expection equation:

    E(x)=xp(x)dx(N>),E(x)=1NΣi=1Nxi

    image-20230506121035292

    image-20230506121001476

    image-20230506121706176

     

    Approximate Gaussian Process Regression

    Probabilistic regression is usually formulated as follows: given a training set D = {(x*i*, yi), i =

    1, . . . ,n} of n pairs of (vectorial) inputs x*i* and noisy (real, scalar) outputs yi, compute the predictive distribution of the function values f (or noisy y) at test locations x. In the simplest case (which we deal with here) we assume that the noise is additive, independent and Gaussian, such that the relationship between the (latent) function f (x) and the observed noisy targets y are given by The joint GP prior and the independent likelihood are both Gaussian

    yi=f(Xi)+ εi , where εi  N(0, σnoise2)

    Gaussian process (GP) regression is a Bayesian approach which assumes a GP prior2 over functions, i.e. assumes a priori that function values behave according to

    p(f|X1, X2,,Xn)= N(0,k)

    where f = [ f1, f2, . . . , fn]> is a vector of latent function values, fi = f (x*i*) and K is a covariance matrix, whose entries are given by the covariance function, Ki j = k(x*i,xj*)

    p(f,f|y)=P(f,f) p(y|f)p(y)

    p(f|y)= p(f,f|y)df= 1p(y)P(f,f) p(y|f)df 

    p(f,f)= N(0, [Kf,fK,fKf,K,])and  p(y|f)= N(f, σnoise2I)

    p(f|y)= N(K,f(Kf,f+ σnoise2I)1y, K,K,f(Kf,f+ σnoise2I)1Kf,) 

    exact expression

    Due to the consistency of Gaussian processes, we know that we can recover p(f_, f) by simply

    integrating (marginalizing) out u from the joint GP prior p(f_, f,u)

    p(f,f)= p(f, f,u)du= p(f,f|u)p(u)du , where p(u)= N(0, Ku,u)

    fundametal approximation:

    p(f, f)q(f, f)= q(f|u)q(f|u)p(u)du

    The name inducing variable is motivated by the fact that f and f* can only communicate though u, and u therefore induces the dependencies between training and test cases. As we shall detail in the following sections, the different computationally efficient algorithms proposed in the literature correspond to different additional assumptions about the two approximate inducing conditionals q(f|u), q(f_|u) of the integral in (8).

    training conditional:p(f|u)=N(Kf,uKu,u1 u, Kf,fQf,f)

    test conditional:p(f|u)=N(K,uKu,u1 u , K,Q,)

    as special (noise free) cases of the standard predictive equation (6) with u playing the role of (noise free) observations. shorthand notation

    Qa,bKa,uKu,u1 Ku,b

    The Subset of Regressors (SoR) Approximation

    For any input x_, the corresponding function value f_ is given by:

    f=K,uWu with p(Wu)= N(0, Ku,u1)

    f=K,uKu,u1 u , with u  N(0,Ku,u)

    The Deterministic Training Conditional (DTC) Approximation

    Projected Process Approximation (PPA) : the method as relying on a likelihood approximation, based on the projection f=Kf,uKu,u1 u

    p(y|f) q(y|u)= N(f, σnoise2I)= N(Kf,uKu,u1 u, σnoise2I)

    Reformulation to qDTC(f|u)= N(Kf,uKu,u1u, 0),  and ,  qDTC(f|u)=p(f|u)

    image-20230519130152624

    q(yfm)exp(12σn2(yPfm)(yPfm)).P=Kmm1Kmn so that E[ffm]=Pfmp(fm)exp(fmKmm1fm/2)
    q(fmy)exp(12fm(Kmm1+1σn2PP)fm+1σn2yPfm),Eq[f(x)]=km(x)Kmm1μ=km(x)(σn2Kmm+KmnKnm)1Kmny,
    Vq[f(x)]=k(x,x)km(x)Kmm1km(x)+km(x)Kmm1cov(fmy)Kmm1km(x)=k(x,x)km(x)Kmm1km(x)+σn2km(x)(σn2Kmm+KmnKnm)1km(x)
    qDTC(ffy)=N(Q,f(Qt,f+σnoise 2I)1y,K,Q,f(Qt,f+σnoise 2I)1Qt,=N(σ2K,uΣKu,fy,K,Q,+K,uΣK,u)Σ=(σ2Ku,fKf,u+Ku,u)1

    The Fully Independent Training Conditional (FITC) Approximation

    While the DTC is based on the likelihood approximation given by (17), the SGPP proposes

    a more sophisticated likelihood approximation with a richer covariance

    p(y|f) p(y|u)= N(f, σnoise2I)= N(Kf,uKu,u1 u, diag[Kf,fQf,f]+  σnoise2I)

    where diag[A] is a diagonal matrix whose elements match the diagonal of A.

    image-20230507234501358

    Posterior Approximations

     

     in 2009, Michalis Titsias published a paper that proposed a different approach: “Variational Learning of Inducing Variables in Sparse Gaussian Processes” (Titsias 2009). This method does not quite fit into the unifying view proposed by Quiñonero-Candela. The key idea is to construct a variational approximation to the posterior process, and learn the pseudo-points Z alongside the kernel parameters by maximising the evidence lower bound (ELBO), i.e. a lower bound on the log-marginal likelihood. There was quite a bit of prior art on variational inference in Gaussian processes (e.g. Csató and Opper 2002; Seeger 2003): Titsias’ important contribution was to treat Z as parameters of the variational approximation, rather than model parameters.

    take and extra M pints on the function, u=f(z)p(y,f,u)=p(y|f)p(f|u)p(u)

     Instead Of doing

    p(f|y,X)=p(y|f)p(f|X)py|f)p(f|X)df

      We will do

    P(u|y,Z)=p(y|u)p(u|Z)p(y|u)p(u|Z)dup(y|u)=p(y|f)p(f|u)p(f|y,u)lnp(y|u)= lnp(y|f)+lnp(f|u)p(f|y,u)
    lnp(yu)=Ep(fu)[lnp(yf)]+Ep(fu)[lnp(fu)p(fy,u)]lnp(yu)=p~(yu)+KL[p(fu)p(fy,u)]p(yf)=N(y|f,σ2I)p(fu)=N(fKnmKmmu,K~)p(u)=N(u0,Kmm)
    p~(yu)=i=1N(yikmnKmm1u,σ2)exp{12σ2(knnkmnKmm1kmn)}

     

    ONLINE SPARSE MATRIX GAUSSIAN PROCESSES

    KERNAL FUNCTION:

    k(xi,xj)=cexp((xixj)2η2)max{0,(1xixjd)ν}

    At each step during online learning of the GP, the proposed algorithm is presented with a training pair that is used to update the existing model. Assuming at time, the model is given by Pt(f) , it is updated upon receiving the training output yt+1 using Bayes law

    pt+1(f)p(yt+1|f)pt(f)

    p(yt+1|f)pt is the measurement model

    Streaming Sparse Gaussian Process Approximations

    This paper provides a new principled framework for deploying Gaussian process probabilistic models

    in the streaming setting.

    Given N input and real-valued output pairs {Xn, yn}n =1N a standard GP regression model assumes yn = f(xn) + ϵn, where f is an unknown function that is corrupted by Gaussian observation noise ϵn  N(0, σy2). Typically, f is assumed to be drawn from a zero-mean GP prior f  GP(0; k( ; |θ)) whose covariance function depends on hyperparameters θ. In this simple model, the posterior over f, p(f|y, θ) and the marginal likelihood p(y|θ) can be computed analytically (here we have collected the observations into a vector y ={yn}n=1N . However, these quantities present a computational challenge resulting in an O(N3) complexity for maximum likelihood training and O(N2) per test point for prediction

    variational free energy approximation scheme which lower bounds the marginal likelihood of the data using a variational distribution q(f) over the latent function:

    logp(yθ)=logdfp(y,fθ)dfq(f)logp(y,fθ)q(f)=Fvfe(q,θ)

    Since Fvfe(q,θ)=logp(y|θ)KL[q(f)||p(f|y,θ)] where KL[|| ]denotes the Kullback–Leibler

    divergence, maximising this lower bound with respect to q(f) guarantees the approximate posterior

    gets closer to the exact posterior p(f|y, θ). Moreover, the variational bound Fvfe(q,θ) approximates

    the marginal likelihood and can be used for learning the hyperparameters θ.

    image-20230522002224288

    image-20230522002244977

    F_vfe (q(u),θ)=df q(f)logp(y|f,θ)p(u|θ)p(fu|u,θ)p(fu|u,θ)q(u)

    =KL[q(u)p(u|θ)]+nduq(u)p(fn|u,θ)logp(yn|fn,θ)

    p(f|y,θ)qvfe(f)p(fu|u,θ)p(u|θ)N(y;KfuKuu1u,σy2I)

    logp(y|θ)Fvfe(θ)=logN(y;0,KfuKuu1Kuf+σy2I)12σy2n(knnKnuKuu1Kun)

    Fvfe(q(u),θ)KL[q(u)p(u|θ)]+N|B|ynBduq(u)p(fn|u,θ)logp(yn|fn,θ)

    Consider an approximation to the true posterior at the previous step, qold(f), which must be updated

    to form the new approximation qnew(f),

    (2)qold (f)p(fyold )=1Z1(θold )p(fθold )p(yold f),(3)qnew (f)p(fyold ,ynew )=1Z2(θnew )p(fθnew )p(yold f)p(ynew f).

    the new approximation cannot access p(yold|f) directly. Instead, we can find an approximation of p(yold|f) by inverting eq. (2),

    (4)p^(fyold ,ynew )=Z1(θold )Z2(θnew )p(fθnew )p(ynew f)qold (f)p(fθold )

    The form of the approximate posterior mirrors that in the batch case, that is, the previous approximate posterior, qold(f) = p(fa;θold)qold(a) where we assume qold(a)= N(a;ma ; Sa). The new posterior approximation takes the same form, but with the new pseudo-points and new hyperparameters: qnew(f) = p(fb;θnew)qnew(b). Similar to the batch case, this approximate inference problem can be turned into an optimization problem using variational inference. Specifically, consider

    (5)KL[qnew (f)p^(fyold ,ynew )]=dfqnew (f)logp(fbb,θnew )qnew (b)Z1(θold )Z2(θnew p(fθnew )p(ynew f)qold (f)p(fθold )=logZ2(θnew )Z1(θold )+dfqnew (f)[logp(aθold )qnew (b)p(bθnew )qold (a)p(ynew f)]
    (6)qnfe(b)p(b)exp(dap(ab)logqold (a)p(aθold )+dfp(fb)logp(ynew f))(7)p(b)N(y^;KfbKbb1b,Σy^,vfo)

     

    where f is the latent function values at the new training points,

    y^=[yncwDaSa1ma],Kfb=[KfbKab],Σy^,vfe=[σy2I00Da],Da=(Sa1Kaa1)1.

    The negative variational free energy is also analytically available,

    (8)F(θ)=logN(y^;0,Kfb^Kbb1Kbf^+Σy^,vfe)12σy^2tr(KffKfbKbb1Kbf)+Δa; where 2Δa=log|Sa|+log|Kaa|+log|Da|+ma(Sa1DaSa1Sa1)matr[Da1Qa]+ const. 

    A Bayesian Approach to Online Learning

    In the Bayesian approach to statistical inference, the degrees of prior belief or plausibility of parameters are expressed within probability distributions, the so called prior distributions (or priors) p(θ). Once this idea is accepted, subsequent inferences can be based on Bayes rule of probability. Formally, we may think that data are generated by a two step process:

    First, the true parameter θ is drawn at random from the prior distribution p(θ).

    Second, the data are drawn at random from P(Dt |θ). Bayes rule yields the conditional probability density (posterior) of the unknown parameter θ, given the data:

    image-20230522003339315

    In order to construct an online algorithm within the Bayesian frameweork, we have to find out how the posterior distribution changes when a new datapoint yt+1 is observed. It can be easily shown that the new posterior corresponding to the new dataset Dt+1 is given in terms of the old posterior and the likelihood of the new example by (How is unknow yet… to check the ‘A Tutorial on Learning With Bayesian Networks’ )

    does not have the form of an online algorithm, because it requires the knowledge of the entire old dataset Dt.

    Minimizing (6.3) can be thought of as minimizing the loss of information in the projection step. For the important case, where the parametric family is an exponential family, i.e. if the densities are of the form

    If we use a general multivariate Gaussian distribution for p(θ|par), then

    par = (mean, covariance) = (θ^i ,Cij ) Matching the moments results in

    Using a simple property of centered Gaussian random variables z, namely the fact that for well behaved functions f, we have E(z f(z)) = E(f(z)) E(z2), we can get the explicit update:

    P(yt+1|θ)p(θ|part(t))dθ= Eu[P(yt+1|θ^(t)+ u)] where u is a zero mean Gaussian random vector with covariance C(t)

    Online Learning

    Often, learning algorithms for estimating the unknown parameter θ∗ are based on the principle of Maximum Likelihood (ML). It states that we should choose a parameter θ which maximizes the likelihood P(Dt|θ) of the observed data. Under weak assumptions, ML estimators are asymptotically efficient. As a learning algorithm, one can use e.g. a gradient descent algorithm and iterate

    θk+1=θkηkgk= θkηkf(θ),  g(θ)=θf(θ)= 1ni=1nθf(θ,xi)

    Here, ET (yk|θ) defines the training energy of the examples to be minimzed by the algorithm. When a new example yt+1 is received, the ML procedure requires that the learner has to update her estimate for θ using all previous data. Hence Dt has to be stored in a memory. The goal of online learning is to calculate a new estimate θ^(t + 1) which is only based on the new data point yt+1, the old estimate θ^(t) (and possibly a set of other auxiliary quantities which have to be updated at each time step, but are much smaller in number than the entire set of previous training data). A popular idea is to use a procedure similar to (4.1), but to replace the training energy of all examples  K=1tET (yk|θ) by the training energy of the most recent one. Hence, we get

    Sparse Online Gaussian Processes

    at each step of the algorithm we combine the Kullback-Leibler divergence between distributions

    image-20230505233859783

    where θ denotes the set of arguments of the densities. If p^ denotes the approximating Gaussian distribution, one usually tries to minimise KL(p^||ppost), with respect to p^ which in contrast to KL(ppost||p^)requires only the computation of expectations over tractable distributions.

    Definition 8.11 (KL divergence). For two distributions q(x) and p(x)

    KL(q|p )=DKL(q|p)= logq(x)  logp(x)q(x)0

    DKL(q|p)= q(x)log(q(x)p(x))dx

    likelihood of a single new data point and the (GP) prior from the result of the previous approximation step. Let p^t denote the Gaussian approximation after processing t examples, we use Byes rule

    ppost(f|y)=p(yt+1|f)pt^(f)p(yt+1|fD)t

    To derive the updated posterior. Since post is no longer Gaussian, we use a variational technique in order to project it to the closest Gaussian process p^t+1 (see Fig. 1). Unlike the usual variational method, we will now minimize the divergence KL(ppost|p^p). This is possible, because in our on-line method, the posterior contains only the likelihood for a single example and the corresponding non Gaussian integral is one-dimensional, which can, for many relevant cases be performed analytically. It is a simple exercise to show (Opper 1998) that the projection results in the matching of the first two moments (mean and covariance) of ppost and the new Gaussian posterior p^t+1

    In order to compute the on-line approximations of the mean and covariance kernel Kt we apply Lemma 1 sequentially with only one likelihood term P(yt|xt) at an iteration step. Proceeding recursively, we arrive at

    The averages in (7) are with respect to the Gaussian process at time t and the derivatives taken with respect to ft+1t = f(Xt+1)t Note again, that these averages only require a one dimensional integration over the process at the input xt+1. Unfolding the recursion steps in the update rules (6) we arrive at the parametrization for the approximate posterior GP at time t as a function of the initial kernel and the likelihoods (“natural parametrisation”):

    The Sparse GP Algorithm

    For each data element (yt+1, xt+1) we will iterate the following steps:

    1. Compute qt+1, rt+1, kt+1, kt+1, e^t+1, and γt+1.

    2. If  γt+1<ϵtol then perform a reduced update with s^t+1 from eq. (11) without extending the size of the parameters α and C. Advance to the next data.

    3. (else) Perform the update eq. (9) using the unit vector et+1. Add the current input to the BV set and compute the inverse of the extended Gram matrix using eq. (24).

    4. If the size of the BV set is larger than d, then compute the scores εi for all BV s from eq. (27), find the basis vector with the minimum score and delete it using eqs. (25).

    Variational free energy (VFE) – Variational learning of inducing variables in spare gaussian processes

    logp(y) q(u,f)logp(y|f)p(f|u)p(u)p(f|u)q(u)dudf

    Suppose we have a training dataset {(xi, yi)}i=1n of n noisy realizations of some unobserved or latent function so that each scalar yi is obtained by adding Gaussian noise to f(x) at input xi, i.e. yi = fi + ǫi, where ǫi ∼ N(0, σ2) and fi = f(xi). Let X denote all training inputs, y all outputs and f the corresponding training latent function values. The joint probability model is  p(y, f ) = p(y|f )p(f )

    Conditional Probabilities

    image-20230522003541092

    where p(y|f ) is the likelihood and p(f ) the GP prior. The data induce a posterior GP which is specified by a posterior mean function and a posterior covariance function:

    To define a sparse method that directly approximates the posterior GP mean and covariance functions in eq. (1). This posterior GP can be also described by the predictive Gaussian p(z|y) =p(z|f )p(f |y)df , where p(z|f ) denotes the conditional prior over any finite set of function points z. Suppose that we wish to approximate the above Bayesian integral by using a small set of m auxiliary inducing variables fm evaluated at the pseudo-inputs Xm, which are independent from the training inputs. fm are just function points drawn from the same GP prior as the training function values f . By using the augmented joint model p(y|f )p(z, fm, f ),

    where p(z, fm, f ) is the GP prior jointly expressed over the function values z, f and fm

    we equivalently write p(z|y) as

    p(z|y)= p(z|fm, f)p(f|fm, y)p(fm|y)dfdfm      (4)

    Suppose now that fm is a sufficient statistic for the parameter f in the sense that z and f are independent given fm, i.e. it holds p(z|fm, f ) = p(z|fm). The above can be written as

    where q(z) = p(z|y) and ϕ(fm) = p(fm|y).

    p(f |fm) = p(f |fm, y) is true since y is a noisy version of f and because of the assumption we made that any z is conditionally independent from f given fm

    p(z|fm,y)= p(y|f)p(z,fm,f)dfp(y|f)p(z,fm,f)dfdz and by using the p(z|fm, f ) = p(z|fm).

    Thus, we expect q(z) to be only an approximation to p(z|y). In such case, we can choose φ(fm) to be a “free” variational Gaussian distribution, where in general φ(fm) != p(fm|y), that depends on a mean vector μ and a covariance matrix A.

    the approximate posterior GP mean and covariance functions as follows

    B = Kmm1AKmm1

    The question that now arises is how do we select the φ distribution, i.e. (μ, A), and the inducing inputs Xm. a variational method that allows to jointly specify these quantities and treat Xm as a variational parameter which is rigorously selected by minimizing the KL divergence.

    使用 q(z)去近似p(z|y),这里p(z|y)是predictive Gaussian

    Next we describe a variational method that allows to jointly specify these quantities and treat Xm as a variational parameter which is rigorously selected by minimizing the KL divergence.

     

    Logistic regression - Optimization

    Sigmoid/logistic function

    P(yi=1|xi, θ)=Πi= sigm(η)=11+eη=11+exiθ

    image-20230519163637630

    image-20230519163647877

    Logistic regression

    πi is success rate, πiyi(1πi)1yi= {πi       yi=11πi    yi=0 

    likelihood function 

    p(y|X,θ)= i=1nBer(yi|sigm(xiθ))=i=1nπiyi(1πi)1yi= i=1n[11+exiθ]yi[11+exiθ]1yi

    , where Xiθ = θ0+j=1dθjxij

    The cost function (NLL the likelihood function) – cross entropy error function

    NLL(W)=logP(y|x,θ)=J(θ)=i=1nyilogπi+(1yi)log(1πi)

    g(θ)=ddθJ(θ)=i=1nxiT(πiyi)=XT(πy)

    H=ddθg(θ)T=iπi(1πi)xixiT=XTdiag(πi(1πi))X=XTSX,

    where S= diag(πi(1πi)), πi=sigmoid(xiθ)

    one can show that H is a positive definite , hence the NLL is convex (凸函数) and has a unique global minimum. To find the minimum …

    1. Gradient descent

    2. Newton’s method

    In the WLS, the unknowns V and β arise together. And the basic idea of Iteratively Re-weighted Least Square is that you are going to update one given the other. The key is finding out the right formula for

    1. V that depends on β

    2. β that depends on V

    Then either one would be initialized and updated iteratively.

    image-20230519163609003

    Note: Found the likelihood take the negative log found the derivative to get the gradient follow the gradient or follow the gradients weighted by the Hessian

    Bayesian logistic regression and MCMC

    Bayesian logistic regression

    p(y|X,θ)= i=1nBer(yi|sigm(xiθ))= i=1n[11+exiθ]yi[11+exiθ]1yi

    Posterior in Bayesian:

    p(θ|D)= p(y|x,θ, )p(θ)p(y|x) , z=p(y|x)= p(y|x,θ)p(θ)dθ this integrate cannot be done by hand . y is binary… so we have p(θ|D)  p(y|x,θ, )p(θ)

    And with gaussian prior p(θ)=(2πΣ)1/2e12(θμ)TΣ1(θμ)

    z=p(y|θ)p(θ)dθintroduce the q(θ)=N(0,1000)z=p(y|θ)p(θ)q(θ)q(θ)dθintroduce the w(θ)=p(y|θ)p(θ)q(θ)z=ω(θ)q(θ)dθΘ(i)q(θ),i=1:Nz1Ni=1Nω(θ(i))

    image-20230519165455741

    we have P(dθ|data)=P(θ|data)dθP(dθ data )=1Ni=1Nω(θ(i))δθ(i)(dθ)δθ(i)(dθ)= Number of samples θ(i) in the interval dθ

    the bayesian prediction function:

    with normalized weight (w)

    P(yt+1xt+1,y1:t,x1:t)=P(yt+1xt+1,θ)P(dθx1:t,y1:t)=P(yt+1xt+1,θ)P(θx1:t,y1:t)dθP(yt+1xt+1,θ)1Ni=1Nω(θ(i))δθ(i)(dθ)1Ni=1NP(yi+1x+1+1,θ)ω(θ(i))δθ(i)(dθ)1Ni=1NP(yt+1xt+1,θ(i)likelihood )ω(θ(i))

    with un-normalized weight (w)

    P(θD)=1zP(Dθ)P(θ)=P(Dθ)P(θ)P(Dθ)P(θ)dθP(yt+1|xt+1,D)=P(yt+1xt+1,y1:t,x1:t)=P(yt+1xt+1,θ)P(θ|D)dθ=1zP(yt+1xt+1,θ)P(D|θ)P(θ)dθ=P(yt+1xt+1,θ)P(Dθ)P(θ)q(θ)q(θ)dθP(Dθ)P(θ)q(θ)q(θ)dθ=P(yt+1xt+1,θ)ω(θ)q(θ)dθω(θ)q(θ)dθ=1Ni=1Nω(θ(i))P(yt+1xt+1,θ(i))1Ni=1Nω(θ(i))=i=1Nω~i(θ(i))P(yt+1xt+1,θ(i))given the wi=ωijωjhere q(θ)N(μ,σ2I),select q hyperparameters could be selected via cross validation or bayesian optimization?
    Π(x)=P(θx1::,yi:1)=1zi=1t[πiyi(1πi)1yi]e12δ2θθΠ~(x)=1zΠ(x)z=Π(x)dx

    want θ(i)P(θY1:t,X1:t)

    Predictive distribution,

    we have D = (X1:n, Y1:n), new input Xn+1, we want to predict Yn+1, Bayesian belive it should integerate out of the effect of the parameters done by maraginalization

    P(Yn+1|Xn+1,D)=P(Yn+1,θ|Xn+1,D)dθ= P(Yn+1|θ, Xn+1,D)P(θ|Xn+1,D)dθ ,

    P(θ|Xn+1,D) is the posterior θ which the Xn+1 will do nothing to it , D is droped due to the θ told the information about the Data…

    = P(Yn+1|θ, Xn+1)P(θ|D)dθ 

    logistic function : P(Yn+1|θ, Xn+1)= πn+1yn+1(1πn+1)1yn+1

    [ Bayesian say the prediction will be given by the likelihood , each prediction will be weigthed by how probablity it is, and the probablity is measured accoridng to the posterior distriubution which is the quantity that take into account prior information and the training data ]

    With Monte Carlo method

    = P(Yn+1|θ, Xn+1)P(θ|D)dθ = 1ni=1nP(Yn+1|θ(i), Xn+1)

    Markov Chain Basic

    image-20230519213253623

    Xn state after n transitions

    pij=P(X1=jX0=i)=P(Xn+1=jXn=i)

    Markov property/assumption: "given current state, the past doesn't matter"

    pij=P(Xn+1=jXn=i)=P(Xn+1=jXn=i,Xn1,,X0)

    Xn+1 conditionally independent to Xn1,,X0with given Xn

    image-20230520203000570

    image-20230520221040191

    image-20230520232437236

     

    image-20230521004115245

    Above gives one distribution answer to the question " what's the system state after certain time t"

    Monte Carlo method

    image-20230519162005088A geometric

    MCMC: Metropolise-Hastings

    (refer to frequncey be in state j) In order to sample from a distribution π(x), a MCMC algorithm constructs and simulates a Markov chain whose stationary distribution is π(x), meaning that, after an initial “burn-in” phase, the states of that Markov chain are distributed according to π(x). We thus just have to store the states to obtain samples from π(x).

    For didactic purposes, let’s for now consider both a discrete state space and discrete “time”. The key quantity characterizing a Markov chain is the transition operator T(xi+1xi) which gives you the probability of being in state xi+1 at time i+1 given that the chain is in state xi at time i.

    a transition matrix T, or be continuous, in which case T would be a transition kernel. while considering continuous distributions, but all concepts presented here transfer to the discrete case.

    If we could design the transition kernel in such a way that the next state is already drawn from π, we would be done, as our Markov chain would… well… immediately sample from π. Unfortunately, to do this, we need to be able to sample from π, which we can’t.

     

    1. Initialise x(0)

    2. For i=0 to N1

      • Sample uU[0,1].

      • Sample xq(xx(i)). e,g. x=x(i)+N(0,62)

      • If u<A(x(i),x)=min{1,p(x)q(x(i)x)p(x(i))q(xx(i))}

    x(i+1)=x

    else

    x(i+1)=x(i)

     

    r(θnew ,θt1)= Posterior probability of θnew  Posterior probability of θt1=p(D|θnew)p(θnew)p(D|θt1)p(θt1))=Beta(1,1,θnew )×Binomial(10,4,θnew )Beta(1,1,θt1)×Binomial(10,4,θt1)

     

    Geometric Series

    Geometric Series -- from Wolfram MathWorld

    series kak is a series for which the ratio of each two consecutive terms ak+1/ak is a constant function of the summation index k. The more general case of the ratio a rational function of the summation index k produces a series called a hypergeometric series. For the simplest case of the ratio ak+1/ak=r equal to a constant r, the terms ak are of the form ak=a0rk. Letting a0=1, the geometric sequence {ak}k=0n with constant |r|<1 is given by

    Sn=k=0nak=k=0nrk

    is given by

    Snk=0nrk=1+r+r2++rn

    Multiplying both sides by r gives

    rSn=r+r2+r3++rn+1

    and subtracting (3) from (2) then gives

    (1r)Sn=(1+r+r2++rn)(r+r2+r3++rn+1)=1rn+1

    so

    Snk=0nrk=1rn+11r

    For 1<r<1, the sum converges as n, in which case

    SS=k=0rk=11r

    Similarly, if the sums are taken starting at k=1 instead of k=0

    k=1nrk=r(1rn)1rk=1rk=r1r

    the latter of which is valid for |r|<1.

     

    Support Vector Machines (SVMs)

    Support Vector Machine (SVM) - Optimization objective

    An alternative view of logistic regression

     

    SVM cost functions from logistic regression cost functions

     

    The complete SVM cost function

    SVM notation is slightly different

     

    Large margin intuition

    Large margin classification mathematics (optional)

    Vector inner products 

    SVM decision boundary

     

     

    Kernels - 1: Adapting SVM to non-linear classifiers

    Diving deeper into the kernel

    What does σ do?

     

    Kernels II

    Choosing the landmarks

    SVM hypothesis prediction with kernels

    SVM training with kernels

     

    SVM - implementation and use

    Choosing a kernel

    Multi-class classification for SVM

    Logistic regression vs. SVM

    *θ is always at 90 degrees to the decision boundary *

    https://stackoverflow.com/questions/10177330/why-is-weight-vector-orthogonal-to-decision-plane-in-neural-networks